This is the simplest way to pass data to the indexer. It was created due to xmlpipe2 limitations. Namely, indexer must map each attribute and field tag in XML file to corresponding schema element. This mapping requires some time. And time increases with increasing the number of fields and attributes in schema. There is no such issue in tsvpipe because each field and attribute is a particular column in TSV file. So, in some cases tsvpipe could work slightly faster than xmlpipe2.
The first column in TSV/CSV file must be a document ID. The rest ones must mirror the declaration of fields and attributes in schema definition.
The difference between tsvpipe and csvpipe is delimiter and quoting rules. tsvpipe has tab character as hardcoded delimiter and has no quoting rules. csvpipe has option csvpipe_delimiter
for delimiter with default value ',' and also has quoting rules, such as:
- any field may be quoted
- fields containing a line-break, double-quote or commas should be quoted
- a double quote character in a field must be represented by two double quote characters
tsvpipe_command
directive is mandatory and contains the shell command invoked to produce the TSV stream which gets indexed. The command can just read a TSV file but it can also be a program that generates on-the-fly the tab delimited content.
The following directives can be used to declare the types of the indexed columns:
tsvpipe_field
- declares atext
field.tsvpipe_field_string
- declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.tsvpipe_attr_uint
- declares an integer attributetsvpipe_attr_timestamp
- declares a timestamp attributetsvpipe_attr_bool
- declares a boolean attributetsvpipe_attr_float
- declares a float attributetsvpipe_attr_bigint
- declares a big integer attributetsvpipe_attr_multi
- declares a multi-value attribute with integerstsvpipe_attr_multi_64
- declares a multi-value attribute with 64bit integerstsvpipe_attr_string
- declares a string attributetsvpipe_attr_json
- declares a JSON attribute
Example of a source using a TSV file
source tsv_test
{
type = tsvpipe
tsvpipe_command = cat /tmp/rock_bands.tsv
tsvpipe_field = name
tsvpipe_attr_multi = genre_tags
}
1 Led Zeppelin 35,23,16
2 Deep Purple 35,92
3 Frank Zappa 35,23,16,92,33,24
csvpipe_command
directive is mandatory and contains the shell command invoked to produce the TSV stream which gets indexed. The command can just read a CSV file but it can also be a program that generates on-the-fly the comma delimited content.
The following directives can be used to declare the types of the indexed columns:
csvpipe_field
- declares atext
field.csvpipe_field_string
- declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.csvpipe_attr_uint
- declares an integer attributecsvpipe_attr_timestamp
- declares a timestamp attributecsvpipe_attr_bool
- declares a boolean attributecsvpipe_attr_float
- declares a float attributecsvpipe_attr_bigint
- declares a big integer attributecsvpipe_attr_multi
- declares a multi-value attribute with integerscsvpipe_attr_multi_64
- declares a multi-value attribute with 64bit integerscsvpipe_attr_string
- declares a string attributecsvpipe_attr_json
- declares a JSON attribute
Example of a source using a CSV file
source csv_test
{
type = csvpipe
csvpipe_command = cat /tmp/rock_bands.csv
csvpipe_field = name
csvpipe_attr_multi = genre_tags
}
1,"Led Zeppelin","35,23,16"
2,"Deep Purple","35,92"
3,"Frank Zappa","35,23,16,92,33,24"
There's a frequent situation when the total dataset is too big to be reindexed from scratch often, but the amount of new records is rather small. Example: a forum with a 1,000,000 archived posts, but only 1,000 new posts per day.
In this case, "live" (almost real time) index updates could be implemented using so called "main+delta" scheme.
The idea is to set up two sources and two indexes, with one "main" index for the data which only changes rarely (if ever), and one "delta" for the new documents. In the example above, 1,000,000 archived posts would go to the main index, and newly inserted 1,000 posts/day would go to the delta index. Delta index could then be reindexed very frequently, and the documents can be made available to search in a matter of minutes. Specifying which documents should go to what index and reindexing main index could also be made fully automatic. One option would be to make a counter table which would track the ID which would split the documents, and update it whenever the main index is reindexed.
Example: Fully automated live updates
# in MySQL
CREATE TABLE sph_counter
(
counter_id INTEGER PRIMARY KEY NOT NULL,
max_doc_id INTEGER NOT NULL
);
# in sphinx.conf
source main
{
# ...
sql_query_pre = SET NAMES utf8
sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
sql_query = SELECT id, title, body FROM documents \
WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
source delta : main
{
sql_query_pre = SET NAMES utf8
sql_query = SELECT id, title, body FROM documents \
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
index main
{
source = main
path = /path/to/main
# ... all the other settings
}
**note how all other settings are copied from main, but source and path are overridden (they MUST be)**
index delta : main
{
source = delta
path = /path/to/delta
}
A better split variable is to use a timestamp column instead of the ID as timestamps can track not just new documents, but also modified ones.
For the datasets that can have documents modified or deleted, the delta index should also provide a list with documents that suffered changes in order to be suppressed and not be used in search queries. This is achieved with the feature called Kill lists. The document ids to be killed can be provided in an auxiliary query defined by sql_query_killlist. The delta must point the indexes for which the kill-lists will be applied by directive killlist_target. The effect of kill-lists is permanent on the target index, meaning even if the search is made without the delta index, the suppressed documents will not appear in searches.
Note how we're overriding sql_query_pre
in the delta source. We need to explicitly have that override. Otherwise REPLACE
query would be run when indexing delta source too, effectively nullifying it. However, when we issue the directive in the inherited source for the first time, it removes all inherited values, so the encoding setup is also lost. So sql_query_pre
in the delta can not just be empty; and we need to issue the encoding setup query explicitly once again.
Adding data from indexes
Merging two existing plain indexes can be more efficient than indexing the data from scratch and desired in some cases (such as merging 'main' and 'delta' indexes instead of simply reindexing 'main' in the 'main+delta' partitioning scheme). So indexer
has an option to do that. Merging indexes is normally faster than reindexing, but still not instant on huge indexes. Basically, it will need to read the contents of the both indexes once and write the result once. Merging 100 GB and 1 GB index, for example, will result in 202 GB of I/O (but that's still likely less than the indexing from scratch requires).
The basic command syntax is as follows:
sudo -u manticore indexer --merge DSTINDEX SRCINDEX [--rotate]
Only the DSTINDEX index will be affected: the contents of SRCINDEX will be merged into it. --rotate
switch will be required if DSTINDEX is already being served by searchd
. The initially devised usage pattern is to merge a smaller update from SRCINDEX into DSTINDEX. Thus, when merging attributes the values from SRCINDEX will win if duplicate document IDs are encountered. Note, however, that the "old" keywords will not be automatically removed in such cases. For example, if there's a keyword "old" associated with document 123 in DSTINDEX, and a keyword "new" associated with it in SRCINDEX, document 123 will be found by both keywords after the merge. You can supply an explicit condition to remove documents from DSTINDEX to mitigate that; the relevant switch is --merge-dst-range
:
sudo -u manticore indexer --merge main delta --merge-dst-range deleted 0 0
This switch lets you apply filters to the destination index along with merging. There can be several filters; all of their conditions must be met in order to include the document in the resulting merged index. In the example above, the filter passes only those records where 'deleted' is 0, eliminating all records that were flagged as deleted.